Description of Data
Three datasets are used in this study, two representing the supply side and one representing the demand side. In below we will go through the source of each dataset and some noteworthy features of the data.
Supply - Stack Overflow 2018 Developer Survey
This is a survey conducted by Stack Overflow in 2018 (data can be found here) on 98,855 developers about how they learn, build their careers, which tools they’re using, and what they want in a job. For our study preliminary on the field of data science, we filtered the data by the developers with the title Data or business analyst or Data scientist or machine learning specialist.
We use this data as a representative of data scientists in U.S job market.
#read the file
stackoverflow <- read_csv("C:/Users/Flora Huang/Downloads/developer_survey_2018/survey_results_public.csv")
stackoverflow <- stackoverflow %>% select(`LanguageWorkedWith`,`DevType`,`DatabaseWorkedWith`,`PlatformWorkedWith`,`Country`,`ConvertedSalary`,`Employment`)
Only the columns in the above table will be used in the analysis of this data set. The LanguageWorkedWith, DatabaseWorkedWith, and PlatformWorkedWith have data exactly corresponding to their column names and are seperated by semi-colons. DevType have developer type data, where we used it to filter data scientis by finding phrases, ‘Data or business analyst’ and ‘Data scientist or machine learning specialist’. Country column contains country names, ConvertedSalary column contains salary data all converted to USD, and Employement column shows whether a person is a full-time or a part-time employee. We will only focus on the full-time employees.
Some noteworthy features of Stack Overflow 2018 Developer Survey:
In a basic analysis of Stack Overflow’s survey, we found some noteworthy and interesting features of the data.
- A. Most Used Database by U.S Data Scientists
#This code graphs a histogram by mostly used database in decreasing order.
#Filters full-time workers, Data Scientists, US workers, and their missing data
languae_hist = stackoverflow%>%
filter(Employment %in% 'Employed full-time') %>% #filter full-time employees
filter(!is.na(DatabaseWorkedWith))%>% #filter missing data
filter(!is.na(DevType)) %>%
filter(DevType %in% c('Data or business analyst','Data scientist or machine learning specialist')) %>% #filter data scientists
filter(Country %in% 'United States') %>% #filter US workers
select(DatabaseWorkedWith)%>%
mutate(DatabaseWorkedWith = str_split(DatabaseWorkedWith, pattern = ";"))%>%
unnest(DatabaseWorkedWith)%>%
group_by(DatabaseWorkedWith)%>%
summarise(Count = n())%>% #count by database
arrange(desc(Count))%>% #reorder in descending order
ungroup()%>%
mutate(DatabaseWorkedWith = reorder(DatabaseWorkedWith, Count))
#slice only top 20 data
languae_hist = slice(languae_hist, 0:20)
#plot in histogram
highchart()%>% #
hc_title(text = paste("Most Used Database by U.S Data Scientists"))%>% #title
hc_xAxis(categories = languae_hist$DatabaseWorkedWith)%>% #xaxis
hc_add_series(data = languae_hist$Count, name = "Count", type = "bar") #plot
This histogram tries to find which database language data scientists use the most. This data filtered full-time employees, data scientists, and US, which consists of 259 rows. The X-axis is count, and the Y-axis denotes each database languages sorted in decreasing order of the counts. It is important to note that the filtered data pertain to data scientists only and does not include database managers. The results shows data scientists tend to prefer the most popular database languages, and SQL is very dominant in this term.
- B. Most Used Platform by U.S Data Scientists
#This code graphs a histogram by mostly used platforms in decreasing order.
#Filters full-time workers, Data Scientists, US workers, and their missing data
languae_hist = stackoverflow %>%
filter(Employment %in% 'Employed full-time') %>% #filter full-time employees
filter(!is.na(PlatformWorkedWith))%>% #filter missing data
filter(!is.na(DevType)) %>%
filter(DevType %in% c('Data or business analyst','Data scientist or machine learning specialist')) %>% #filter data scientists
filter(Country %in% 'United States') %>% #filter US workers
select(PlatformWorkedWith)%>%
mutate(PlatformWorkedWith = str_split(PlatformWorkedWith, pattern = ";"))%>%
unnest(PlatformWorkedWith)%>%
group_by(PlatformWorkedWith)%>%
summarise(Count = n())%>% #count by languages
arrange(desc(Count))%>% #reorder in descending order
ungroup()%>%
mutate(PlatformWorkedWith = reorder(PlatformWorkedWith, Count))
#slice only top 20 data
languae_hist = slice(languae_hist, 0:20)
#plot
highchart()%>% #
hc_title(text = paste("Mostly Used Platform by U.S Data Scientists"))%>% #title
hc_xAxis(categories = languae_hist$PlatformWorkedWith)%>% #xaxis
hc_add_series(data = languae_hist$Count, name = "Count", type = "bar") #plot
This is a histogram of platform usage by data scientists. Again, it filters platform, full-time, and US with 200 rows. The X-axis is count, and the Y-axis denotes each platforms sorted in decreasing order of the counts. There is nothing particular interesting about this result, since the three widely used platforms, Linux, mac, and Windows are also most popular platforms for data scientists.
Supply - Kaggle ML and Data Science Survey, 2018
This is an industry-wide survey on 23,859 data scientists and machine learning engineers conducted by Kaggle. (data can be found here) The survey looks into who is working with data, what’s happening at the cutting edge of machine learning across industries, and how new data scientists can best break into the field, which matches our objective of this study.
We use this data as a representative of data scientists in international job market.
#read data
df_survey_mcq <- as.tibble(fread(str_c("C:/Users/Flora Huang/Desktop/Exploratory Data Analysis and Visualization/Final project/code/data/multipleChoiceResponses.csv"),na.strings = "-1"))
df_survey_free <- as.tibble(fread(str_c("C:/Users/Flora Huang/Desktop/Exploratory Data Analysis and Visualization/Final project/code/data/freeFormResponses.csv"),na.strings = "-1"))
While the survey looks into 228 features of respondents’ job, we only select the features that we are interested in, which are:
- In which country do you currently reside?
- What is your gender?
- What is your gender? - Prefer to self-describe - Text“,
- What is your age (# years)?
- What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
- Which best describes your undergraduate major? - Selected Choice
- Select the title most similar to your current role (or most recent title if retired): - Selected Choice
- Select the title most similar to your current role (or most recent title if retired): - Other - Text
- What is the type of data that you currently interact with most often at work or school? - Selected Choice
- What specific programming language do you use most often? - Selected Choice
- What is your current yearly compensation (approximate $USD)?
- How long have you been writing code to analyze data?
- For how many years have you used machine learning methods (at work or in school)?
Some noteworthy features of Kaggle ML and Data Science Survey, 2018:
In a basic analysis of Kaggle ML and Data Science Survey, 2018, we found some noteworthy and interesting features of the data.
df_survey_mcq <- as.tibble(fread(str_c("C:/Users/Flora Huang/Desktop/Exploratory Data Analysis and Visualization/Final project/code/data/multipleChoiceResponses.csv"),skip=1))
df_survey_free <- as.tibble(fread(str_c("C:/Users/Flora Huang/Desktop/Exploratory Data Analysis and Visualization/Final project/code/data/freeFormResponses.csv"), skip = 1))
#clean the country variable to analyze it
df_survey_mcq <- df_survey_mcq %>%
mutate(country = `In which country do you currently reside?`)
pop <- df_survey_mcq %>%
count(country) %>%
filter(!(country %in% c("Other", "I do not wish to disclose my location"))) %>%
mutate(iso3 = countrycode(country, origin = "country.name", destination = "iso3c"))
- B. Where do the survey’s respondents reside in?
#plot the country variable
df_survey_mcq %>%
group_by(country) %>%
count() %>%
ungroup() %>%
head(10) %>%
ggplot(aes(reorder(country, n, FUN = min), n)) +
geom_col(fill="#79AAEF") +
labs(x = "", y = "Number of Respondents") +
theme(legend.position = "none") +
ggtitle("Country of Residence: US & India dominate") +
coord_flip()

df_survey_mcq <- df_survey_mcq %>%
mutate(country = `In which country do you currently reside?`)
pop <- df_survey_mcq %>%
count(country) %>%
filter(!(country %in% c("Other", "I do not wish to disclose my location"))) %>%
mutate(iso3 = countrycode(country, origin = "country.name", destination = "iso3c"))
Insights:
We can see that the survey data is mainly from people from U.S and India. Africa, as a country, is very undererepresented, so we might not be able to draw a lot of conclusions for the african continent.
Demand - Rachel’s Mail - Columbia University Data Science Career Opportunities
Students studying in Columbia University’s Data Science Master are no strangers to “Rachel’s Mail”, a mail send out from our Assistant Director of Student Services & Career Development, Ms.Rachel Fuld Cohen. Ms.Cohen sends out emails to every DSI students when new data science jobs are listed and is one of the primary way of students in DSI to apply for summer internships or full time jobs. We used this data as our demand side data that helps us understand what the job market is looking for in prospective job candidates.
We scrapped all Rachel’s email tagged with “Career Opportunities” by Mozilla Thinderbird’s software (cridentials such as Columbia student Uni accounts are needed), which can be done pretty easily in:
In our first semester (2018/8/24 - 2018/11/29), 30 career opportunities emails were sent out by Rachel, which in total listed 82 job offers.
The features we studied from the data were:
- General info: Company’s name, Job title, Location, Industry
- Job listing time: Date and days the emails were sent to students
- How to apply: Email to HR, Apply on company’s websites, On-campus interviews
- Job type: Internships, Full-time jobs
- Description: The requirements of applicants listed on the job posting
rachel <- read_csv("C:/Users/Flora Huang/Desktop/Rachel Email/rachel.csv")
Some noteworthy features of Rachel’s career opportunity email:
In a basic analysis of Rachel’s email, we found some noteworthy and interesting features of the data.
- A. On which days are the career opportunity emails sent?
rachel$Day <- factor(rachel$Day, levels= c("Monday",
"Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
rachel %>% count(Day) %>% hchart(type = "column", hcaes(x = Day, y = n)) %>% hc_title(text="On which days are the career emails sent")
We can see from the graph that most of Rachel’s career opportunity emails are sent out on Friday. Therefore, if a student is actively seeking for job opportunities, they should pay more attention of Friday to be one of the first applicants.
- B. On which dates are the career opportunity emails sent?
rachel$Date <- as.Date(rachel$Date, "%m/%d/%Y")
rachel %>% count(Date) %>% hchart(type = "column", hcaes(x = Date, y = n)) %>% hc_title(text="On which dates are the career emails sent")
We can see that the highest frequency of recieving career opportunity emails are between early October and mid November, which was the time of fall recruitment season and when Data Science Institute held our own career fair. The emails were then pretty fairly distributed within the other times during our research period.
- C. How can we submit the job applications through career opportunities emails?
rachel$Apply <- factor(rachel$Apply)
rachel %>% count(Apply) %>% hchart(type = "column", hcaes(x = Apply, y = n)) %>% hc_title(text="Ways of applying for jobs through career emails")
We can see that the main ways to apply for jobs through the career opportunity emails are emailing the HR and applying on the company’s website. There are also some chances that the company would come to Columbia’s campus to have on campus interviews with the students.
- D. Where are the jobs based at in the career opportunities emails?
#Analyze the locations
job_location <- data.frame("Location" = c("San Francisco", "Boston", "Washington", "New York"), "Jobs" = c(sum(str_count(rachel$Location, "San Francisco")), sum(str_count(rachel$Location, "Boston")), sum(str_count(rachel$Location, "Washington")), sum(str_count(rachel$Location, "New York"))))
#Plot
highchart() %>% hc_xAxis(type = 'category') %>% hc_add_series(job_location, "column", hcaes(x = Location, y = Jobs)) %>% hc_title(text="Location of jobs in the career emails")
We can see from the graph that almost all the jobs in the career opportunities emails are based in New York, which is reasonable due to Columbia University’s location.
- E. What industries are the jobs in the career opportunities emails in?
rachel %>% count(Industry) %>% arrange(n) %>% hchart(type = "column", hcaes(x = Industry, y = n)) %>% hc_title(text="Industries of jobs in the career emails")
We can see from the graph that most jobs listed in the career opportunities emails are from Finance and Technology sectors, with health sectors following up as the third major industry. There are clear examples of how these industries have been implementing data science in their work, and the great demand for data scientists in these industries can be a good sign for our fellow classmates who are interested in developing a career in them.
Executive summary (Presentation-style)
Age
We are interested in knowing the age distribution of data scientists in the job market.
foo <- df_survey_mcq %>%
group_by(age) %>%
count()
p2 <- foo %>%
mutate(percentage = str_c(as.character(round(n/sum(foo$n)*100,1)), "%")) %>%
ggplot(aes(age, n)) +
geom_col(fill="#79AAEF") +
labs(x = "Age Group", y = "Respondents") + theme_grey(16) +
theme(legend.position = "none", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
ggtitle("Age groups of international data scientists")
p2

We find that most of the people are in the age group of 20-35 and thus showing that the field is mainly dominated by the younger generation. This makes sens as the field is relatively new and thus people just graduating would be more interested in the field.
Gender
We are interested in knowing the gender imbalance of data scientists in the job market.
foo <- df_survey_mcq %>%
group_by(gender) %>%
count()
p1 <- foo %>%
mutate(percentage = str_c(as.character(round(n/sum(foo$n)*100,1)), "%")) %>%
ggplot(aes(gender, n)) +
geom_col(fill="#79AAEF") +
labs(x = "Gender", y = "Respondents") + theme_grey(16) +
theme(legend.position = "none", axis.text.x = element_text(angle=15, hjust=1, vjust=0.9)) +
ggtitle("Gender imbalance of international data scientists")
p1

We can see that there is a huge gender imbalance among the male and the female and other genders, which is normally the case in any STEM field.
Age by gender
#plot and analyze the age and gender variable
p3 <- df_survey_mcq %>%
filter(gender %in% c("Male", "Female")) %>%
ggplot(aes(age, fill = gender)) +
geom_bar() + theme_grey(12) +
labs(x = "", y = "Percentage") +
ggtitle("Age by Gender of international data scientists") + scale_fill_manual(values=cbPalette)
p3

Comparing age by gender, we see that the younger generations are doing better compared to the older generations in terms of the sex ratio and thus there are more women opting for data science as a career path.
Education
Education in top 6 fields
#visulaize the education level of respondents
foo <- df_survey_mcq %>%
filter(!is.na(major)) %>%
group_by(major) %>%
count() %>%
ungroup() %>%
top_n(6, n)
labs <- c("Doctoral", "Professional", "Master's", "Bachelor's", "Others")
df_survey_mcq %>%
filter(!is.na(edu) & edu != "No answer") %>%
semi_join(foo, by = "major") %>%
count(edu, major) %>%
ggplot(aes(edu, n)) +
geom_col(fill="#79AAEF") +
theme(legend.position = "none",
axis.text.x = element_text(angle=35, hjust=1, vjust=0.9),
strip.text.x = element_text(size = 7)) +
guides(fill = guide_legend(ncol = 2)) +
labs(x = "", y = "Number of Respondents") +
facet_wrap(~ major, ncol = 2, scales = "free_y") +
ggtitle("Education in top 6 fields of international data scientists") + scale_x_discrete(labels= labs)

We see that the most common degree is masters for all the undergrad majors. But a lot of the PhDs are also in this industry. Apart from we notice that the PhD are more common when joining the industry from physics major.
Demand – General Skills Wanted in Job Listings
We used the general skills listed in Jeff Hale’s study and searched for their occurences in the job listing requirements in Columbia University Data Science career opportunities emails.
#Keyword Analysis
keywords <- data.frame("General_Skills" = c("Analysis", "Machine Learning", "Statistics", "Computer Science", "Communication", "Mathematics", "Visualization", "AI", "Deep Learning", "NLP", "Software Development", "Neural Network", "Project Management", "Software Engineering", "Data Engineering"), "Count" = c(length(grep("analysis", tolower(rachel$Description))), length(grep("machine learning", tolower(rachel$Description))), length(grep("statistics", tolower(rachel$Description))), length(grep("computer science", tolower(rachel$Description))), length(grep("communication", tolower(rachel$Description))), length(grep("mathematics", tolower(rachel$Description))), length(grep("visualization", tolower(rachel$Description))), length(grep("AI", rachel$Description))+length(grep("artificial intelligence", tolower(rachel$Description))), length(grep("deep learning", tolower(rachel$Description))), length(grep("NLP", rachel$Description))+length(grep("natural language processing", tolower(rachel$Description))), length(grep("software development", tolower(rachel$Description))), length(grep("neural network", tolower(rachel$Description))), length(grep("project management", tolower(rachel$Description))), length(grep("software engineering", tolower(rachel$Description))), length(grep("data engineering", tolower(rachel$Description)))))
#Plot
keywords %>% arrange(desc(Count)) %>% hchart(type = "column", hcaes(x = General_Skills, y = Count)) %>% hc_xAxis(type = 'category') %>% hc_title(text="General Skills in Data Scientist Job Listings")
Our results show that statistical analysis, communication and machine learning are at the heart of data scientist jobs, and that matches with Jeff Hale’s study on major job listing websites. The result matches with our expectations, since the primary function of data science is to use statistical analysis to draw useful insights from data. Machine learning and its subsets - AI and deep learning, also show up frequently since these are the major techniques in the field of data science to create systems to predict performance and are very in demand.
It is also noteworthy that communication tops the rank in both studies’ job listing descriptions. It tells us that it is very important for data scientists to be able communicate insights and work with others.
Demand – Technical Skills Wanted in Job Listings
We used the technical skills listed in Jeff Hale’s study and searched for their occurences in the job listing requirements in Columbia University Data Science career opportunities emails.
#keywords
skills <- data.frame("Skills" = c("Python", "R", "SQL", "Hadoop", "Spark", "Java", "SAS", "Tableau", "Hive", "Scala", "AWS", "C++", "MATLAB", "Tensorflow", "Excel", "Linux", "Azure", "scikit-learn"), "Count" = c(length(grep("python", tolower(rachel$Description))), length(grep("R", rachel$Description)), length(grep("SQL", rachel$Description)), length(grep("hadoop", tolower(rachel$Description))), length(grep("spark", tolower(rachel$Description))), length(grep("java", tolower(rachel$Description))), length(grep("SAS", rachel$Description)), length(grep("tableau", tolower(rachel$Description))), length(grep("hive", tolower(rachel$Description))), length(grep("scala", tolower(rachel$Description))), length(grep("AWS", rachel$Description)), length(grep("C\\+\\+", rachel$Description)), length(grep("matlab", tolower(rachel$Description))), length(grep("tensorflow", tolower(rachel$Description))), sum(str_count( tolower(rachel$Description), "\\bexcel\\b")), length(grep("linux", tolower(rachel$Description))), sum(str_count( tolower(rachel$Description), "\\bazure\\b")), length(grep("scikit-learn", tolower(rachel$Description))) ))
#plot
skills %>% arrange(desc(Count)) %>% hchart(type = "column", hcaes(x = Skills, y = Count)) %>% hc_xAxis(type = 'category') %>% hc_title(text="Top 20 technology skills in Data Scientist Job Listings")
Our results show that R, Python and SQL are the most demanded technical skills in data scientist jobs, and that matches with Jeff Hale’s study on major job listing websites. Python and R are not very far off from each other and are dominant in frequency, which makes the two languages a must for virtually every data scientist position. SQL is also in high demand. SQL stands for Structured Query Language and is the primary way to interact with relational database.
Demanded Technical Skills:Internship vs Full-Time
We are also interested to know if the technical skills wanted in job listings would vary by job types. In Columbia University Data Science career opportunities emails, both internship positions and full-time jobs positions are listed together. We want to know if the industry would be looking for different skills in the two types of applicants.
#Filter data set
rachel_intern <- rachel[rachel$Type == "Internship",]
rachel_full <- rachel[rachel$Type == "Full Time",]
#Internship
skills_intern <- data.frame("Skills" = c("Python", "R", "SQL", "Hadoop", "Spark", "Java", "SAS", "Tableau", "Hive", "Scala", "AWS", "C++", "MATLAB", "Tensorflow", "Excel", "Linux", "Azure", "scikit-learn"), "Count" = c(length(grep("python", tolower(rachel_intern$Description))), length(grep("R", rachel_intern$Description)), length(grep("SQL", rachel_intern$Description)), length(grep("hadoop", tolower(rachel_intern$Description))), length(grep("spark", tolower(rachel_intern$Description))), length(grep("java", tolower(rachel_intern$Description))), length(grep("SAS", rachel_intern$Description)), length(grep("tableau", tolower(rachel_intern$Description))), length(grep("hive", tolower(rachel_intern$Description))), length(grep("scala", tolower(rachel_intern$Description))), length(grep("AWS", rachel_intern$Description)), length(grep("C\\+\\+", rachel_intern$Description)), length(grep("matlab", tolower(rachel_intern$Description))), length(grep("tensorflow", tolower(rachel_intern$Description))), sum(str_count( tolower(rachel_intern$Description), "\\bexcel\\b")), length(grep("linux", tolower(rachel_intern$Description))), sum(str_count( tolower(rachel_intern$Description), "\\bazure\\b")), length(grep("scikit-learn", tolower(rachel_intern$Description))) ))
#Full-Time
skills_full <- data.frame("Skills" = c("Python", "R", "SQL", "Hadoop", "Spark", "Java", "SAS", "Tableau", "Hive", "Scala", "AWS", "C++", "MATLAB", "Tensorflow", "Excel", "Linux", "Azure", "scikit-learn"), "Count" = c(length(grep("python", tolower(rachel_full$Description))), length(grep("R", rachel_full$Description)), length(grep("SQL", rachel_full$Description)), length(grep("hadoop", tolower(rachel_full$Description))), length(grep("spark", tolower(rachel_full$Description))), length(grep("java", tolower(rachel_full$Description))), length(grep("SAS", rachel_full$Description)), length(grep("tableau", tolower(rachel_full$Description))), length(grep("hive", tolower(rachel_full$Description))), length(grep("scala", tolower(rachel_full$Description))), length(grep("AWS", rachel_full$Description)), length(grep("C\\+\\+", rachel_full$Description)), length(grep("matlab", tolower(rachel_full$Description))), length(grep("tensorflow", tolower(rachel_full$Description))), sum(str_count( tolower(rachel_full$Description), "\\bexcel\\b")), length(grep("linux", tolower(rachel_full$Description))), sum(str_count( tolower(rachel_full$Description), "\\bazure\\b")), length(grep("scikit-learn", tolower(rachel_full$Description))) ))
Top 20 technology skills in Data Scientist Internship Job Listings
skills_intern %>% arrange(desc(Count)) %>% hchart(type = "column", hcaes(x = Skills, y = Count)) %>% hc_xAxis(type = 'category') %>% hc_title(text="Top 20 technology skills in Data Scientist Internship Job Listings")
Top 20 technology skills in Data Scientist Full Time Job Listings
skills_full %>% arrange(desc(Count)) %>% hchart(type = "column", hcaes(x = Skills, y = Count)) %>% hc_xAxis(type = 'category') %>% hc_title(text="Top 20 technology skills in Data Scientist Full Time Job Listings")
We can see that R, Python and SQL are still the top three demanded technical skill, but the rank is a little different by job types. The rank in full-time job listings matches with our previos findings, whereas R has a slight edge over Python in internship jobs. The technical skills required in internship positions are also less than full-time job positions.
Supply – Technical Skills People Have (USA data)
We used Stack Overflow 2018 Developer Survey to analyze the technical skills U.S based data scientists have. The analysis filters full-time workers, Data Scientists, US workers, and their missing data.
#This code graphs a histogram by mostly used languages in decreasing order.
#Filters full-time workers, Data Scientists, US workers, and their missing data
languae_hist = stackoverflow %>%
filter(Employment %in% 'Employed full-time') %>% #filter full-time employees
filter(!is.na(LanguageWorkedWith))%>% #filter missing data
filter(!is.na(DevType)) %>%
filter(DevType %in% c('Data or business analyst','Data scientist or machine learning specialist')) %>% #filter data scientists
filter(Country %in% 'United States') %>% #filter US workers
select(LanguageWorkedWith)%>%
mutate(LanguageWorkedWith = str_split(LanguageWorkedWith, pattern = ";"))%>%
unnest(LanguageWorkedWith)%>%
group_by(LanguageWorkedWith)%>%
summarise(Count = n())%>% #count by languages
arrange(desc(Count))%>% #reorder in descending order
ungroup()%>%
mutate(LanguageWorkedWith = reorder(LanguageWorkedWith, Count))
#slice only top 20 data
languae_hist = slice(languae_hist, 0:20)
highchart()%>% #
hc_title(text = paste("Most Used Language by U.S Data Scientists"))%>% #title
hc_xAxis(categories = languae_hist$LanguageWorkedWith)%>% #xaxis
hc_add_series(data = languae_hist$Count, name = "Count", type = "bar") #plot
The purpose of this histogram is to explore which language data scientists in U.S. use. This data filters full-time data scientists, working in the US, which conists of 317 rows of data. The X-axis is count, and the Y-axis denotes each languages sorted in decreasing order of the counts. Our result shows that most data scientists in U.S use Python, SQL, and R.
Supply – Technical Skills People Have (World Data)
First popular programming language
#visualize the first popular language
foo <- df_survey_mcq %>%
filter(!(country %in% c("Other", "I do not wish to disclose my location"))) %>%
mutate(country = as.character(case_when(
country == "United Kingdom of Great Britain and Northern Ireland" ~ "UK",
country == "United States of America" ~ "USA",
country == "Viet Nam" ~ "Vietnam",
TRUE ~ as.character(country)
))) %>%
group_by(lang, country) %>%
filter(!is.na(lang)) %>%
count() %>%
arrange(desc(n)) %>%
group_by(country) %>%
slice(c(1))
world %>%
filter(region != "Antarctica") %>%
left_join(foo, by = c("region" = "country")) %>%
ggplot() +
geom_polygon(aes(x = long, y = lat, fill = lang, group = group), color = "white") +
coord_fixed(1.3) +
labs(fill = "") +
theme(legend.position = "top") +
theme_void() +
theme(legend.position = "top") +
ggtitle("Primary Programming Language of international data scientists",
subtitle = "Python has conquered the world; New Zealand is the only R stronghold")

Second popular programming language
#visualize 2nd popular language
foo <- df_survey_mcq %>%
filter(!(country %in% c("Other", "I do not wish to disclose my location"))) %>%
mutate(country = as.character(case_when(
country == "United Kingdom of Great Britain and Northern Ireland" ~ "UK",
country == "United States of America" ~ "USA",
country == "Viet Nam" ~ "Vietnam",
TRUE ~ as.character(country)
))) %>%
group_by(lang, country) %>%
filter(!is.na(lang)) %>%
count() %>%
arrange(desc(n)) %>%
group_by(country) %>%
slice(c(2))
world %>%
filter(region != "Antarctica") %>%
left_join(foo, by = c("region" = "country")) %>%
ggplot() +
geom_polygon(aes(x = long, y = lat, fill = lang, group = group), color = "white") +
coord_fixed(1.3) +
labs(fill = "") +
theme(legend.position = "top") +
theme_void() +
theme(legend.position = "top") +
ggtitle("Secondary Programming Language of international data scientists")

Median Salary of Data Scientists by Country
by_country_salary <- stackoverflow %>% select(Country, ConvertedSalary, DevType) %>% filter(!is.na(DevType)) %>%
filter(DevType %in% c('Data or business analyst','Data scientist or machine learning specialist')) %>% #filter data scientists
mutate(ConvertedSalary=as.numeric(ConvertedSalary)) %>% filter(!is.na(Country)) %>% filter(!is.na(ConvertedSalary)) %>%
group_by(Country) %>% summarize(MedSalary = median(ConvertedSalary, na.rm=TRUE))
data(worldgeojson, package = "highcharter") #using highcharter
code <- countrycode(by_country_salary$Country, 'country.name', 'iso3c') #get country code
by_country_salary$iso3 <- code
by_country_salary$MedSalary <- round(by_country_salary$MedSalary/1000) #round
#plot
highchart() %>%
hc_add_series_map(worldgeojson, by_country_salary, value = "MedSalary", joinBy = "iso3") %>%
hc_colorAxis(stops = color_stops()) %>%
hc_legend(enabled = TRUE) %>%
hc_title(text = "Median Salary of full time data scientists by country") %>%
hc_tooltip(useHTML = TRUE, headerFormat = "",
pointFormat = "Country: {point.Country} / Median Salary: ${point.MedSalary}K") %>% hc_add_theme(hc_theme_google())
This world map shows median salary for full time data scientists by country. Brighter colored countries have higher median salary, and the darkered colored countries vice versa. The countries without a color are countries with no data.
We can see from the chart that USA has the highest median salary for data scientists, which is around $92k per year.
Median Salary of by Programming Language You Know
#This code graphs a histogram of median salary by most used languages by decreasing order.
#Filters full-time workers, Data Scientists, US workers, and their missing data
median_sal_lang = stackoverflow %>%
filter(Employment %in% 'Employed full-time') %>%
filter(!is.na(LanguageWorkedWith))%>%
filter(!is.na(DevType)) %>%
filter(DevType %in% c('Data or business analyst','Data scientist or machine learning specialist')) %>% filter(Country %in% 'United States') %>%
select(LanguageWorkedWith,ConvertedSalary) %>% #The data already converted salaries to USD
mutate(LanguageWorkedWith = str_split(LanguageWorkedWith, pattern = ";")) %>%
unnest(LanguageWorkedWith) %>%
group_by(LanguageWorkedWith) %>%
summarise(Median_Salary = median(ConvertedSalary,na.rm = TRUE)) %>% #summarise each language by salary
arrange(desc(Median_Salary))%>% #descending order
ungroup()%>%
mutate(LanguageWorkedWith = reorder(LanguageWorkedWith, Median_Salary))
#slice top 20 data
median_sal_lang = slice(median_sal_lang, 0:20)
#plot
highchart()%>%
hc_title(text = paste("Median salary of full-time data scientists by programming language used"))%>%
hc_xAxis(categories = median_sal_lang$LanguageWorkedWith)%>%
hc_add_series(data = median_sal_lang$Median_Salary, name = "Median Salary", type = "bar")
This shows a median salary by language data scientists use, and the data consists of 317 rows. It is interesting to see that the newly introduced languages such as Hack or Scala have higher median salary compared to the well known languages such as Python or R. However, median salary for Python and R are both 100k which corresponds to the overall data scientists’ median salary in the US.